Overview

In this homework assignment, we will explore, analyze and model a data set containing approximately 8000 records representing a customer at an auto insurance company. We will build multiple linear regression models on the continuous variable TARGET_AMT and binary logistic regression model on the boolean variable TARGET_FLAG to predict the probability that a person will crash their car, and to predict the associated costs.

We are going to build several models using different techniques and variable selection. In order to best assess our predictive models, we will create a validation set within our training data along an 80/20 training/testing proportion, before applying the finalized models to a separate evaluation dataset that does not contain the target.

1. Data Exploration

The insurance training dataset contains 8161 observations of 26 variables, each record represents a customer at an auto insurance company. The evaluation dataset contains 2141 observations of 26 variables. These include demographic measures such as age and gender, socioeconomic measures such as education and household income, and vehicle-specific metrics such as car model, age and assessed value.

Each record also has two response variables. The first response variable, TARGET_FLAG, is a boolean where “1” means that the person was in a car crash. The second response variable, TARGET_AMT is a numeric indicating the (positive) cost if a car crash occurred; this value is zero if the person did not crash their car.

We can explore a sample of the training data here, and make some initial observations:

  • Some of the variables are character though they should be numeric and vice-versa.
  • Some currency variables are strings with ‘$’ symbols instead of numerics.
  • Some character variables include a prefix z_ that could be removed for readability.

1.1 Summary Statistics

The table below provides valuable descriptive statistics about the training data:

Data summary
Name train_df
Number of rows 8161
Number of columns 25
_______________________
Column type frequency:
character 10
numeric 15
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
PARENT1 0 1 2 3 0 2 0
MSTATUS 0 1 2 3 0 2 0
SEX 0 1 1 1 0 2 0
EDUCATION 0 1 3 12 0 5 0
JOB 0 1 6 12 0 9 0
CAR_USE 0 1 7 10 0 2 0
CAR_TYPE 0 1 3 11 0 6 0
RED_CAR 0 1 2 3 0 2 0
REVOKED 0 1 2 3 0 2 0
URBANICITY 0 1 19 19 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
TARGET_FLAG 0 1.00 0.26 0.44 0 0 0 1 1.0
TARGET_AMT 0 1.00 1504.32 4704.03 0 0 0 1036 107586.1
KIDSDRIV 0 1.00 0.17 0.51 0 0 0 0 4.0
AGE 6 1.00 44.79 8.63 16 39 45 51 81.0
HOMEKIDS 0 1.00 0.72 1.12 0 0 0 1 5.0
YOJ 454 0.94 10.50 4.09 0 9 11 13 23.0
INCOME 445 0.95 61898.09 47572.68 0 28097 54028 85986 367030.0
HOME_VAL 464 0.94 154867.29 129123.77 0 0 161160 238724 885282.0
TRAVTIME 0 1.00 33.49 15.91 5 22 33 44 142.0
BLUEBOOK 0 1.00 15709.90 8419.73 1500 9280 14440 20850 69740.0
TIF 0 1.00 5.35 4.15 1 1 4 7 25.0
OLDCLAIM 0 1.00 4037.08 8777.14 0 0 0 4636 57037.0
CLM_FREQ 0 1.00 0.80 1.16 0 0 0 2 5.0
MVR_PTS 0 1.00 1.70 2.15 0 0 1 3 13.0
CAR_AGE 510 0.94 8.33 5.70 -3 1 8 12 28.0

/

Based on this summary table and exploration of the data, we can make the following observations:

  • 14 variables are categorical, 12 are numeric.
  • There is no missing data for character variables.
  • Numeric variables with missing values include YOJ (6%), INCOME (5%), HOME_VAL (6%), CAR_AGE (6%), and AGE (1%).
  • Most of the numeric variables have a minimum of zero.
  • One variable, CAR_AGE has a negative value of -3, which doesn’t make intuitive sense.

1.2 Distributions

Before building a model, we need to make sure that we have both classes equally represented in our TARGET_FLAG variable. Class 1 takes 27% and class 0 takes 63% of the target variable. As a result, we have unbalanced class distribution for our target variable that we have to deal with, we have to take some additional steps (bootstrapping, etc) before using logistic regression.

Distribution of Target Flag
Value %
0 0.74
1 0.26

Many of these distributions seem highly skewed and non-normal. As part of our data preparation we’ll use power transformations to find whether transforming variables to more normal distributions improves our models’ efficacy.

1.3 Box Plots

Commentary

1.4 Scatter Plot

Interestingly, none of our predictors appear to have strong linear relationships to our TARGET_AMT response variable, which is a primary assumption of linear regression. This suggests that alternative methods might be more successful in modeling the relationships.

1.5 Correlation Matrix

Commentary

2. Data preparation

2.1 Data types

In order to work with our training dataset, we’ll need to first convert some variables to more useful data types:

  • Convert currency columns from character to integer: INCOME,HOME_VAL,BLUEBOOK and OLDCLAIM.
  • Convert character columns to factors: TARGET_FLAG, CAR_TYPE, CAR_USE, EDUCATION, JOB, MSTATUS, PARENT1, RED_CAR, REVOKED, SEX and URBANICITY.

2.3 Transformations and Missing Values

Before we go further, we need to identify and handle any missing, NA or negative data values so we can perform log transformations and regression.

First, we’ll apply transformations to clean up and align formatting of our variables:

  • Drop the INDEX variable.
  • Remove “z_” from all character class values.
  • Update RED_CAR, replace [no,yes] values with [No, Yes] values.
  • Replace JOB blank values with ‘Unknown’.

Next, we’ll manually adjust two special cases of missing or outlier values.

  • In cases where YOJ is zero and INCOME is NA, we’ll set INCOME to zero to avoid imputing new values over legitimate instances of non-employment.
  • There is also at least one value of CAR_AGE that is less than zero - we’ll assume this is a data collection error and set it to zero (representing a brand-new car.)

We’ll use MICE to impute our remaining variables with missing values - AGE, YOJ, CAR_AGE, INCOME and HOME_VALUE:

  • We might reasonably assume that relationships exist between these variables (older, more years on the job may correlate with higher income and home value). Taking simple means or medians might suppress those features, but MICE should provide a better imputation.

Next we’ll want to consider any power transformations for variables that have skewed distributions. For example, our numeric response variable TARGET_AMT is a good candidate for transformation as its distribution is very highly skewed, and the assumption of normality is required in order to apply linear regression.

  • Log transformation will be applied to variables INCOME, TARGET_AMT, OLDCLAIM to transform their distributions from right-skewed to normally distributed.
  • Similarly, BoxCox transformation will be applied to variables BLUEBOOK, TRAVTIME, TIF, so they also are more normally distributed.

To give our models more variables to work with, we’ll engineer some additional features:

  • Create bin values for CAR_AGE, HOME_VAL and TIF.
  • Create dummy variables for two-level factors, MALE, MARRIED, LIC_REVOKED, CAR_RED, PRIVATE_USE, SINGLE_PARENT and URBAN.

We can examine our final, transformed training dataset and distributions below (with a temporary numeric variable CAR_CRASH to represent)

Data summary
Name train_df
Number of rows 8161
Number of columns 28
_______________________
Column type frequency:
factor 11
numeric 17
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
TARGET_FLAG 0 1 FALSE 2 0: 6008, 1: 2153
EDUCATION 0 1 FALSE 5 Hig: 2330, Bac: 2242, Mas: 1658, <Hi: 1203
JOB 0 1 FALSE 9 Blu: 1825, Cle: 1271, Pro: 1117, Man: 988
CAR_TYPE 0 1 FALSE 6 SUV: 2294, Min: 2145, Pic: 1389, Spo: 907
MALE 0 1 FALSE 2 0: 4375, 1: 3786
MARRIED 0 1 FALSE 2 1: 4894, 0: 3267
LIC_REVOKED 0 1 FALSE 2 0: 7161, 1: 1000
CAR_RED 0 1 FALSE 2 0: 5783, 1: 2378
PRIVATE_USE 0 1 FALSE 2 1: 5132, 0: 3029
SINGLE_PARENT 0 1 FALSE 2 0: 7084, 1: 1077
URBAN 0 1 FALSE 2 1: 6492, 0: 1669

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
TARGET_AMT 0 1 -1.21 5.69 -4.61 -4.61 -4.61 6.94 11.59
KIDSDRIV 0 1 0.17 0.51 0.00 0.00 0.00 0.00 4.00
AGE 0 1 44.78 8.63 16.00 39.00 45.00 51.00 81.00
HOMEKIDS 0 1 0.72 1.12 0.00 0.00 0.00 1.00 5.00
YOJ 0 1 10.48 4.10 0.00 9.00 11.00 13.00 23.00
INCOME 0 1 61461.41 47434.08 0.00 27684.00 53483.00 85479.00 367030.00
HOME_VAL 0 1 154644.85 129340.41 0.00 0.00 160874.00 238349.00 885282.00
TRAVTIME 0 1 15.08 5.82 3.00 11.17 15.34 19.12 45.61
BLUEBOOK 0 1 182.32 48.72 62.21 147.95 182.18 216.49 381.01
TIF 0 1 1.63 1.25 0.00 0.00 1.62 2.43 4.70
OLDCLAIM 0 1 0.56 6.54 -4.61 -4.61 -4.61 8.44 10.95
CLM_FREQ 0 1 0.80 1.16 0.00 0.00 0.00 2.00 5.00
MVR_PTS 0 1 1.70 2.15 0.00 0.00 1.00 3.00 13.00
CAR_AGE 0 1 8.34 5.70 0.00 1.00 8.00 12.00 28.00
CAR_AGE_BIN 0 1 2.48 1.12 1.00 1.00 2.00 3.00 4.00
HOME_VAL_BIN 0 1 2.45 1.16 1.00 1.00 2.00 3.00 4.00
TIF_BIN 0 1 2.41 1.16 1.00 1.00 2.00 3.00 4.00

2.4 Visualizations

We can use Mosaic Plots to illustrate the relationship of binary factor variables to TARGET_FLAG:

  • Observation
  • Observation

We can also use Mosaic Plots to illustrate the relationship of multi-level factor variables to TARGET_FLAG:

  • Observation
  • Observation

2.x Training and Validation Sets

To proceed with modeling, we’ll split our training data into train (80%) and validation (20%) datasets.

3. Multiple linear regression

3.1 Model 1 - Lasso

## 41 x 1 sparse Matrix of class "dgCMatrix"
##                                 s1
## (Intercept)          -1.450670e+00
## KIDSDRIV              8.242256e-01
## AGE                  -1.590170e-03
## HOMEKIDS              2.966269e-02
## YOJ                  -1.843188e-02
## INCOME               -6.032096e-06
## HOME_VAL              .           
## EDUCATIONBachelors   -6.083644e-01
## EDUCATIONHigh School  6.004131e-02
## EDUCATIONMasters     -4.377078e-01
## EDUCATIONPhD         -1.234115e-01
## JOBClerical           1.422133e-01
## JOBDoctor            -9.728426e-01
## JOBHome Maker         .           
## JOBLawyer            -1.137389e-01
## JOBManager           -1.529192e+00
## JOBProfessional      -1.013136e-01
## JOBStudent           -1.994015e-01
## JOBUnknown           -6.904978e-01
## TRAVTIME              6.940606e-02
## BLUEBOOK             -7.084013e-03
## TIF                  -1.700780e-01
## CAR_TYPEPanel Truck   6.532344e-01
## CAR_TYPEPickup        7.223041e-01
## CAR_TYPESports Car    1.591797e+00
## CAR_TYPESUV           1.135741e+00
## CAR_TYPEVan           7.850491e-01
## OLDCLAIM              4.862112e-02
## CLM_FREQ              7.523577e-02
## MVR_PTS               2.709291e-01
## CAR_AGE              -7.336134e-03
## CAR_AGE_BIN           .           
## HOME_VAL_BIN         -3.339554e-01
## TIF_BIN              -1.534265e-01
## MALE1                 4.100768e-02
## MARRIED1             -8.001639e-01
## LIC_REVOKED1          1.591403e+00
## CAR_RED1              3.091972e-02
## PRIVATE_USE1         -1.677192e+00
## SINGLE_PARENT1        1.042326e+00
## URBAN1                3.650730e+00
## 41 x 1 sparse Matrix of class "dgCMatrix"
##                                 s1
## (Intercept)          -1.123592e+00
## KIDSDRIV              5.781403e-01
## AGE                   .           
## HOMEKIDS              .           
## YOJ                   .           
## INCOME               -6.145362e-06
## HOME_VAL              .           
## EDUCATIONBachelors    .           
## EDUCATIONHigh School  2.565739e-01
## EDUCATIONMasters      .           
## EDUCATIONPhD          .           
## JOBClerical           .           
## JOBDoctor             .           
## JOBHome Maker         .           
## JOBLawyer             .           
## JOBManager           -1.060109e+00
## JOBProfessional       .           
## JOBStudent            .           
## JOBUnknown            .           
## TRAVTIME              4.069176e-02
## BLUEBOOK             -5.715392e-03
## TIF                  -1.354175e-01
## CAR_TYPEPanel Truck   .           
## CAR_TYPEPickup        .           
## CAR_TYPESports Car    5.614647e-01
## CAR_TYPESUV           2.868858e-01
## CAR_TYPEVan           .           
## OLDCLAIM              6.328528e-02
## CLM_FREQ              1.384936e-03
## MVR_PTS               2.502005e-01
## CAR_AGE              -1.463889e-02
## CAR_AGE_BIN           .           
## HOME_VAL_BIN         -3.804650e-01
## TIF_BIN              -6.383520e-02
## MALE1                 .           
## MARRIED1             -4.457439e-01
## LIC_REVOKED1          1.321738e+00
## CAR_RED1              .           
## PRIVATE_USE1         -1.384149e+00
## SINGLE_PARENT1        1.010721e+00
## URBAN1                2.952228e+00
## 
## Call:  cv.glmnet(x = X, y = Y, nfolds = 5, family = "gaussian", standardize = TRUE,      alpha = 1) 
## 
## Measure: Mean-Squared Error 
## 
##      Lambda Index Measure     SE Nonzero
## min 0.01331    51   25.33 0.4037      37
## 1se 0.16407    24   25.73 0.4060      20

## $mse
## lambda.1se 
##   25.50668 
## attr(,"measure")
## [1] "Mean-Squared Error"
## 
## $mae
## lambda.1se 
##   4.166593 
## attr(,"measure")
## [1] "Mean Absolute Error"
## $AICc
## [1] -48241.06
## 
## $BIC
## [1] -47990.49
## $AICc
## [1] -44758.86
## 
## $BIC
## [1] -44623.31
## 41 x 1 sparse Matrix of class "dgCMatrix"
##                                 s1
## (Intercept)          -1.450670e+00
## KIDSDRIV              8.242256e-01
## AGE                  -1.590170e-03
## HOMEKIDS              2.966269e-02
## YOJ                  -1.843188e-02
## INCOME               -6.032096e-06
## HOME_VAL              .           
## EDUCATIONBachelors   -6.083644e-01
## EDUCATIONHigh School  6.004131e-02
## EDUCATIONMasters     -4.377078e-01
## EDUCATIONPhD         -1.234115e-01
## JOBClerical           1.422133e-01
## JOBDoctor            -9.728426e-01
## JOBHome Maker         .           
## JOBLawyer            -1.137389e-01
## JOBManager           -1.529192e+00
## JOBProfessional      -1.013136e-01
## JOBStudent           -1.994015e-01
## JOBUnknown           -6.904978e-01
## TRAVTIME              6.940606e-02
## BLUEBOOK             -7.084013e-03
## TIF                  -1.700780e-01
## CAR_TYPEPanel Truck   6.532344e-01
## CAR_TYPEPickup        7.223041e-01
## CAR_TYPESports Car    1.591797e+00
## CAR_TYPESUV           1.135741e+00
## CAR_TYPEVan           7.850491e-01
## OLDCLAIM              4.862112e-02
## CLM_FREQ              7.523577e-02
## MVR_PTS               2.709291e-01
## CAR_AGE              -7.336134e-03
## CAR_AGE_BIN           .           
## HOME_VAL_BIN         -3.339554e-01
## TIF_BIN              -1.534265e-01
## MALE1                 4.100768e-02
## MARRIED1             -8.001639e-01
## LIC_REVOKED1          1.591403e+00
## CAR_RED1              3.091972e-02
## PRIVATE_USE1         -1.677192e+00
## SINGLE_PARENT1        1.042326e+00
## URBAN1                3.650730e+00

Model Performance

## # A tibble: 5 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 mape    standard      70.5  
## 2 smape   standard     110.   
## 3 mase    standard       0.776
## 4 mpe     standard      66.1  
## 5 rmse    standard       5.03

3.2 Model 2 -

3.3 Model 3 -

3.4 Model selection

4. Binary logistic regression

4.1 Model 1 - Lasso

## 41 x 1 sparse Matrix of class "dgCMatrix"
##                                 s1
## (Intercept)          -1.716098e+00
## KIDSDRIV              4.114268e-01
## AGE                  -1.274043e-03
## HOMEKIDS              2.779506e-02
## YOJ                  -5.912864e-03
## INCOME               -3.964779e-06
## HOME_VAL              .           
## EDUCATIONBachelors   -3.129844e-01
## EDUCATIONHigh School  1.563180e-02
## EDUCATIONMasters     -2.467550e-01
## EDUCATIONPhD         -4.260311e-02
## JOBClerical           6.781448e-02
## JOBDoctor            -4.680465e-01
## JOBHome Maker        -2.933138e-02
## JOBLawyer             .           
## JOBManager           -7.506532e-01
## JOBProfessional      -3.377667e-02
## JOBStudent           -1.395579e-01
## JOBUnknown           -2.049771e-01
## TRAVTIME              3.869096e-02
## BLUEBOOK             -4.230398e-03
## TIF                  -5.004348e-02
## CAR_TYPEPanel Truck   4.776375e-01
## CAR_TYPEPickup        4.281607e-01
## CAR_TYPESports Car    8.755467e-01
## CAR_TYPESUV           6.355374e-01
## CAR_TYPEVan           5.122177e-01
## OLDCLAIM              2.504929e-02
## CLM_FREQ              3.513050e-02
## MVR_PTS               1.068372e-01
## CAR_AGE              -2.727552e-03
## CAR_AGE_BIN           .           
## HOME_VAL_BIN         -1.660078e-01
## TIF_BIN              -1.261495e-01
## MALE1                 5.464939e-03
## MARRIED1             -4.532368e-01
## LIC_REVOKED1          6.960323e-01
## CAR_RED1              3.006562e-02
## PRIVATE_USE1         -8.083996e-01
## SINGLE_PARENT1        3.882197e-01
## URBAN1                2.285270e+00
## 41 x 1 sparse Matrix of class "dgCMatrix"
##                                 s1
## (Intercept)          -1.464448e+00
## KIDSDRIV              3.295352e-01
## AGE                  -1.769623e-04
## HOMEKIDS              1.048500e-02
## YOJ                   .           
## INCOME               -3.680548e-06
## HOME_VAL              .           
## EDUCATIONBachelors   -1.070104e-01
## EDUCATIONHigh School  7.383992e-02
## EDUCATIONMasters     -3.456691e-02
## EDUCATIONPhD          .           
## JOBClerical           3.677502e-02
## JOBDoctor            -7.665609e-02
## JOBHome Maker         .           
## JOBLawyer             .           
## JOBManager           -5.667630e-01
## JOBProfessional       .           
## JOBStudent            .           
## JOBUnknown            .           
## TRAVTIME              2.797189e-02
## BLUEBOOK             -3.228510e-03
## TIF                  -5.098952e-02
## CAR_TYPEPanel Truck   .           
## CAR_TYPEPickup        6.408488e-02
## CAR_TYPESports Car    4.575649e-01
## CAR_TYPESUV           2.774664e-01
## CAR_TYPEVan           3.657761e-02
## OLDCLAIM              2.599506e-02
## CLM_FREQ              1.974628e-02
## MVR_PTS               9.972623e-02
## CAR_AGE              -7.803403e-03
## CAR_AGE_BIN           .           
## HOME_VAL_BIN         -1.642793e-01
## TIF_BIN              -8.078956e-02
## MALE1                 .           
## MARRIED1             -3.288395e-01
## LIC_REVOKED1          6.046503e-01
## CAR_RED1              .           
## PRIVATE_USE1         -7.491766e-01
## SINGLE_PARENT1        3.783721e-01
## URBAN1                1.913118e+00
## 
## Call:  cv.glmnet(x = X, y = Y, nfolds = 5, family = "binomial", link = "logit",      standardize = TRUE, alpha = 1) 
## 
## Measure: Binomial Deviance 
## 
##       Lambda Index Measure       SE Nonzero
## min 0.001033    51  0.9039 0.010248      37
## 1se 0.007285    30  0.9132 0.009009      28

## $deviance
## lambda.min 
##  0.8930299 
## attr(,"measure")
## [1] "Binomial Deviance"
## 
## $class
## lambda.min 
##  0.2074142 
## attr(,"measure")
## [1] "Misclassification Error"
## 
## $auc
## [1] 0.8143323
## attr(,"measure")
## [1] "AUC"
## 
## $mse
## lambda.min 
##  0.2897296 
## attr(,"measure")
## [1] "Mean-Squared Error"
## 
## $mae
## lambda.min 
##  0.5843315 
## attr(,"measure")
## [1] "Mean Absolute Error"
## $AICc
## [1] -1628.926
## 
## $BIC
## [1] -1378.357
## $AICc
## [1] -1565.658
## 
## $BIC
## [1] -1375.96
## 41 x 1 sparse Matrix of class "dgCMatrix"
##                                 s1
## (Intercept)          -1.716098e+00
## KIDSDRIV              4.114268e-01
## AGE                  -1.274043e-03
## HOMEKIDS              2.779506e-02
## YOJ                  -5.912864e-03
## INCOME               -3.964779e-06
## HOME_VAL              .           
## EDUCATIONBachelors   -3.129844e-01
## EDUCATIONHigh School  1.563180e-02
## EDUCATIONMasters     -2.467550e-01
## EDUCATIONPhD         -4.260311e-02
## JOBClerical           6.781448e-02
## JOBDoctor            -4.680465e-01
## JOBHome Maker        -2.933138e-02
## JOBLawyer             .           
## JOBManager           -7.506532e-01
## JOBProfessional      -3.377667e-02
## JOBStudent           -1.395579e-01
## JOBUnknown           -2.049771e-01
## TRAVTIME              3.869096e-02
## BLUEBOOK             -4.230398e-03
## TIF                  -5.004348e-02
## CAR_TYPEPanel Truck   4.776375e-01
## CAR_TYPEPickup        4.281607e-01
## CAR_TYPESports Car    8.755467e-01
## CAR_TYPESUV           6.355374e-01
## CAR_TYPEVan           5.122177e-01
## OLDCLAIM              2.504929e-02
## CLM_FREQ              3.513050e-02
## MVR_PTS               1.068372e-01
## CAR_AGE              -2.727552e-03
## CAR_AGE_BIN           .           
## HOME_VAL_BIN         -1.660078e-01
## TIF_BIN              -1.261495e-01
## MALE1                 5.464939e-03
## MARRIED1             -4.532368e-01
## LIC_REVOKED1          6.960323e-01
## CAR_RED1              3.006562e-02
## PRIVATE_USE1         -8.083996e-01
## SINGLE_PARENT1        3.882197e-01
## URBAN1                2.285270e+00

Model Performance

##          True
## Predicted    0   1 Total
##     0     1102 244  1346
##     1      100 187   287
##     Total 1202 431  1633
## 
##  Percent Correct:  0.7893

Checking Model Assumptions

The lasso regression solve multicollinearity issue by selecting the variable with the largest coefficient while setting the rest to (nearly) zero.

4.2 Model 2 -

4.3 Model 3 -

4.4 Model selection

** Model 3.1: Lasso Linear Regression **

** Model 4.1: Lasso Logistic Regression **

##        F1 
## 0.5208914

Regression

Model mape smape mase mpe rmse AIC
M4.1:Lasso Linear 70.4935 110.106 0.7764 66.0545 5.0267 -48241.06

Logistic

Model Accuracy Classification error rate F1 Deviance R2 Sensitivity Specificity Precision AIC
M4.1:Lasso Logistic 0.8091 0.1909 0.5209 0.9021 NA 0.4339 0.9168 0.6516 -1628.926

5. Predictions

6. Conclusion

7. References

Appendix: R code